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Abstract 

We  propose  a  new  model  for  parallel  speedup  that  is  based 
on  two  parameters,  the  average  parallelism  of  a  program 
and  its  variance  in  parallelism.  We  present  a  way  to  use 
the  model  to  estimate  these  program  characteristics  us¬ 
ing  only  observed  speedup  curves  (as  opposed  to  the  more 
detailed  program  knowledge  otherwise  required).  We  ap¬ 
ply  this  method  to  speedup  curves  from  real  programs  on 
a  variety  of  architectures  and  show  that  the  model  fits 
the  observed  data  well.  We  propose  several  applications 
for  the  model,  including  the  selection  of  cluster  sizes  for 
parallel  jobs. 

1  Introduction 

Speedup  models  describe  the  relationship  between  cluster 
size  and  execution  time  for  a  parallel  program.  These 
models  are  useful  for: 

Modeling  parallel  workloads  :  Many  simulation 

studies  use  a  speedup  model  to  generate  a  stochas¬ 
tic  workload.  Since  our  model  captures  the  behavior 
of  many  real  programs,  it  lends  itself  to  a  realistic 
workload  model. 

Summarizing  program  behavior  :  If  a  program  has 
run  before  (maybe  on  a  range  of  cluster  sizes),  we 
can  record  past  execution  times  and  use  a  speedup 
model  to  summarize  the  historical  data  and  estimate 
future  execution  times.  These  estimates  are  useful  for 
scheduling  and  allocation. 

Inference  of  program  characteristics  :  The  parame¬ 
ters  of  our  model  correspond  to  measureable  program 
characteristics.  Thus  we  hypothesize  that  we  can  in¬ 
fer  these  characteristics  by  fitting  our  model  to  an 

*EECS  —  Computer  Science  Division,  University  of  California, 
Berkeley,  CA  94720  and  San  Diego  Supercomputer  Center,  P.O. 
Box  85608,  San  Diego,  CA  92186.  Supported  by  NSF  grant  ASC- 
89-02825  and  by  Advanced  Research  Projects  Agency/ITO,  Dis¬ 
tributed  Object  Computation  Testbed,  ARPA  order  No.  D570, 
Issued  by  ESC/ENS  under  contract  #F19628-96-C-0020.  email: 
downey@sdsc.edu,  http:/ /www.sdsc.edu/~  downey 


observed  speedup  curve  and  finding  the  parameters 
that  yield  the  best  fit. 

Our  speedup  model  is  a  non-linear  function  of  two  pa¬ 
rameters:  A,  which  is  the  average  parallelism  of  a  job,  and 
it,  which  approximates  the  coefficient  of  variation  of  paral¬ 
lelism.  The  family  of  curves  described  by  this  model  spans 
the  theoretical  space  of  speedup  curves.  In  [7],  Eager,  Za- 
horjan  and  Lazowska  derive  upper  and  lower  bounds  for 
the  speedup  of  a  program  on  various  cluster  sizes  (subject 
to  simplifying  assumptions  about  the  program’s  behav¬ 
ior).  When  cr  =  0,  our  model  matches  the  upper  bound; 
as  cr  approaches  infinity,  our  model  approaches  the  lower 
bound  asymptotically. 

This  model  might  be  used  differently  for  different  appli¬ 
cations.  In  [5]  and  [6]  we  use  it  to  generate  the  stochas¬ 
tic  workload  we  use  to  evaluate  allocation  strategies  for 
malleable1  jobs.  For  that  application,  we  choose  the  pa¬ 
rameters  A  and  <7  from  distributions  and  use  them  to  gen¬ 
erate  speedup  curves.  In  this  paper,  we  work  the  other 
way  around  —  we  use  observed  speedup  curves  to  estimate 
the  parameters  of  real  programs.  Our  goal  here  is  to  show 
that  this  model  captures  the  behavior  of  real  programs 
running  on  diverse  parallel  architectures.  This  technique 
is  also  useful  for  summarizing  the  speedup  curve  of  a  job 
and  interpolating  between  speedup  measurements. 

1.1  Related  work 

In  [4],  Dowdy  proposed  a  speedup  model  based  on  a  pro¬ 
gram  with  a  sequential  component  of  length  ci  and  a 
perfectly  parallel  component  of  length  c^.  The  execution 
time,  T(n),  of  such  a  program  is  T(n)  =  ci  +  where 

n  is  the  number  of  processors. 

Chiang  et  al.  [3]  derive  from  this  a  model  of  speedup 
with  the  form  S(n)  =  (1  +  f3)n/(n  +  /?),  where  the  param¬ 
eter  [3  is  a  program  characteristic  that  varies  from  0  for 
a  sequential  program  to  infinity  for  a  program  with  linear 
speedup.  Several  subsequent  studies  have  been  based  on 

1 A  malleable  job  is  a  parallel  program  that  can  run  on  a  range 
of  cluster  sizes.  The  allocation  strategy  is  the  part  of  the  scheduler 
that  chooses  the  cluster  size  for  each  malleable  job. 
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this  model  [10]  [14].  Brecht  and  Guha  use  a  variation  of 
this  model  that  imposes  an  upper  bound  on  the  speedup 
of  some  jobs  [1]  [9]. 

One  problem  with  this  model  is  that  the  parameter  ft 
has  little  semantic  content.  Thus,  it  is  not  clear  how  to 
use  observations  of  a  real  program  to  find  the  value  of  ft 
or  how  to  choose  a  distribution  of  values  that  describes 
a  real  workload.  As  a  result,  workload  models  based  on 
Dowdy’s  speedup  model  have  tended  to  overestimate  the 
parallelism  available  in  codes  executing  in  supercomput¬ 
ing  environments.  With  our  model,  we  have  been  able  to 
use  observations  of  the  workload  at  the  San  Diego  Super¬ 
computer  Center  to  infer  the  parameters  of  real  workloads 
[5]  [6], 

Sevcik  [16]  and  Ghosal  et  al.  [8]  have  proposed  alter¬ 
native  models  based  on  more  detailed  program  informa¬ 
tion.  These  models  have  many  free  parameters,  and  there¬ 
fore  provide  no  way  to  infer  program  characteristics  from 
observed  behavior.  Furthermore,  it  would  be  difficult  to 
specify  the  range  of  these  parameters  in  a  real  workload. 

Srnirni  et  al.  [17]  use  a  speedup  model  with  the  fol¬ 
lowing  functional  form:  S(n)  =  (1  —  rn)/(  1  —  r)  with 
0  <  t  <  1.  The  motivation  for  this  model  is  to  facilitate 
analysis.  Again,  the  parameter  t  has  no  semantic  content. 

No  prior  study  has  demonstrated  that  a  proposed  model 
describes  the  behavior  of  real  programs.  Chakrabarti  et 
al.  [2]  propose  a  model  for  efficiency  of  data  parallel  tasks; 
they  use  measurements  of  ScaLAPACIv  programs  to  vali¬ 
date  this  model. 

Many  of  the  allocation  strategies  that  have  been  pro¬ 
posed  for  malleable  jobs  assume  that  the  scheduler  knows 
the  average  parallelism  of  all  jobs  [16]  [8]  [15]  [17]  [3]  [12] 
[1].  Thus  all  of  these  strategies  require  that  the  paral¬ 
lelism  profile  of  the  program  be  known,  or  that  A  (and 
maybe  V)  can  be  calculated  by  other  means.  Our  model 
may  provide  a  way  to  derive  these  characteristics. 


2  The  model 

The  design  goal  for  our  speedup  model  is  to  find  a  fam¬ 
ily  of  speedup  curves  that  are  parameterized  by  the  av¬ 
erage  parallelism  of  the  program,  A,  and  the  variance  in 
parallelism,  V.  To  do  this,  we  construct  a  hypothetical 
parallelism  profile 2  with  the  desired  values  of  A  and  V, 
and  then  usg  this  profile  to  derive  speedups.  We  use,  two 
families  of  profiles,  one  for  programs  with  low  variance, 
the  other  for  programs  with  high  variance. 

2  The  parallelism  profile  is  the  distribution  of  potential  parallelism 
of  a  program[16]. 
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Figure  1:  The  parallelism  profile  for  (a)  the  low  variance 
speedup  model  and  (b)  the  high  variance  speedup  model. 


2.1  Low  variance  model  (<r  <  1) 

Figure  1a,  shows  a  hypothetical  parallelism  profile  for  a 
program  with  low  variance  in  degree  of  parallelism.  The 
parallelism  is  equal  to  A,  the  average  parallelism,  for  all 
but  some  fraction  a  of  the  duration  (0  <  cr  <  1).  The  re¬ 
maining  time  is  divided  between  a  sequential  component, 
and  a,  high-parallelism  component,  (with  parallelism  cho¬ 
sen  such  that,  the  average  parallelism  is  A).  The  variance 
of  parallelism  is  V  =  er (A  —  l)2. 

A  program  with  this  profile  would  have  the  following 
run  time  as  a,  function  of  cluster  size: 

1  <  n  <  A 

A  <  n  <  2A  —  1  (1) 

n  >  2A  -  1 

where  n  is  the  cluster  size  (number  of  processors).  Thus 
T(l)  =  A  and  T(o o )  =  1.  The  speedup,  S(n)  = 
T(l)/T(n),is 


T(n)  = 


<7U~1/2)  +  1  -  cr/2 
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1  <  n  <  ,4 

-4  <  n  <2.4  —  1  (2) 

n  >2,4-1 


2.2  High  variance  model  >  !) 

In  the  previous  section,  the  parameter  a  is  bounded  be¬ 
tween  0  and  1,  and  thus  the  variance  of  the  parallelism 
profile  is  limited  to  V  =  (A  —  l)2  when  a  =  1.  In  this  sec¬ 
tion,  we  propose  an  extended  model  in  which  sigma  can 
exceed  1  and  the  variance  is  unbounded.  The  two  models 
can  be  combined  naturally  because  (1)  when  the  parame¬ 
ter  <7=1,  the  two  models  are  identical,  and  (2)  for  both 
models  the  variance  of  the  parallelism  profile  is  <r(,4—  l)2. 

From  this  latter  property  we  derive  the  semantic  con¬ 
tent.  of  the  parameter  <7  —  it  is  approximately  the  square 
of  i  lie  coefficient  of  variation  of  parallelism,  CV2 .  This 
approximation  follows  from  the  definition  of  coefficient  of 
variation,  C'V  =  VV /A.  Thus,  CV 2  is  u(A  —  l)2/,43,. 
which  for  large  A  is  approximately  <7. 

Figure  lb  shows  a  hypothetical  parallelism  profile  for 
a  program  with  high  variance  in  parallelism.  The  profile 
consists  of  a  sequential  component  of  duration  <7  and  a 
parallel  component  of  duration  1  and  potential  parallelism 
A  +  Aa  —  <7.  A  program  with  this  profile  would  have  the 
following  run  time  as  a  function  of  cluster  size: 


<7  +  A+Aa  a  1  <  n  <  A  +  Au  —  u 

T(n)=< J  "  -  -  (3) 

<7  +  1  n  >  A  +  Au  —  u 


Thus  T(l)  =  ,4(<7  +  l)  and  T(o o)  =  <7  +  1.  The  speedup  is 


S(n) 


nA{( 7+1) 
<j(n+Ji~l)+A 

,4 


1  <  n  <  A  +  Au  —  u 
n  >  A  +  Au  —  u 


(4) 


Figure  2  shows  a  set  of  speedup  curves  for  a  range  of 
values  of  <7.  When  <7  =  0  the  curve  matches  the  theoretical 
upper  bound  for  speedup  —  bound  at  first,  by  the  “hard¬ 
ware  limit.”  (the  45  degree  line)  and  then  by  the  “software 
limit.”  (the  average  parallelism  ,4).  As  <7  approaches  infin¬ 
ity,  the'  curve  approaches  the  theoretical  lower  bound  on 
speedup  [7]: 


Slow  ( ) 


An 

A  +  n  —  1 
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Figure  2:  Speedup  curves  for  a  range  of  values  of  <r. 


2.3  Calculating  the  knee 

Several  authors  have  proposed  the  idea,  that  an  optimal  al¬ 
location  for  a.  program  is  the  one  the  maximizes  the  power, 
+ .  which  is  defined  as  the  product,  of  the  speedup  and  the 
efficiency,  e(n)  =  s(n)/n.  Thus,  $  =  s2/n.  We  search  for 
the  value  of  n  that,  maximizes  $  by  finding  local  maxima, 
where  ^  =  0: 


dA 

dn 


2ns -  s2 

_ an _  _  rj 

nz 

ds  0 
2ns—  = 
dn 


(6) 


The  speedup  curves  proposed  in  Equations  2  and  4  have 
the  functional  form  s(n)  =  nn/u(n ),  where  k  is  a.  con¬ 
stant.  with  respect,  to  n,  and  u(n)  is  some  function  of  n. 
Substituting  this  functional  form  into  Equation  6  yields: 


2n 


nn 

u 


:u~n7c  _  k2"2 


u  =  2n 


du 

dn 


(7) 


Then,  using  Equation  2,  we  can  find  the  “knee”  of  the 
low- variance  speedup  curve: 


u 

du 

dn 


u(A  —  1/2)  +  n(l  —  u/2) 


1  -  u/2 
u(A  -  1/2) 


where  n*  is  the  optimal  cluster  size.  Using  Equation  4  for 
the  high  variance  model: 
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Maximum  power  allocation  vs.  variance 


Figure  3:  The  optimal  allocation  for  a  range  of  values  of  a. 
The  average  parallelism,  A,  is  64. 
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Figure  4:  Power  as  a  function  of  cluster  size  for  several  values 
of  a.  The  small  circle  on  each  curve  indicates  the  point  of 
maximum  power.  The  average  parallelism,  A,  is  64. 


u 

du 

dn 


Figure  3  shows  the  optimal  cluster  size,  n* ,  as  a  function 
of  cr,  with  A  fixed  at  64.  When  a  is  less  than  2x4/ (3x4—  1), 
which  is  approximately  2/3,  the  point  of  maximum  power 
is  n*  =  x4,  which  is  in  accord  with  the  heuristic  that  the 
number  of  processors  allocated  to  a  job  should  be  equal 
to  its  average  parallelism  [7]  [8].  It  also  agrees  with  the 
colloquial  interpretation  of  the  “knee”  of  the  curve  —  a 
discontinuity  in  its  slope; 

But  as  the  value  of  a  approaches  1,  n*  increases  quickly 
to  2x4  —  1.  For  larger  values  of  <7,  it  decreases  gradually 
and  approaches  x4  —  1  asymptotically.  This  result  is  both 
surprising  and  discouraging:  surprising  because  it  violates 
the  intuition  that  the  optimal  allocation  for  a  job  should 
decline  monotonically  as  the  variance  in  parallelism  in¬ 
creases  [16],  and  discouraging  because;  the  rapid  change  in 
the  point  of  maximum  power  suggests  (1)  that  the  “knee” 
of  the  curve  is  not  well-defined  —  small  changes  in  ob¬ 
served  speedups  might  lead  to  drastically  different  alloca¬ 
tions,  and  (2)  that  the  assumption  that  the  point  of  maxi¬ 
mum  power  is  an  optimal  allocation  is  probably  wrong.  In 
[5]  we  confirm  that  scheduling  strategies  that  attempt  to 
allocate  this  “optimal”  cluster  size  do  not  perform  as  well 
as  other  strategies,  including  one  that  simply  allocates  x4 
processors  to  each  job. 

Figure  4  confirms  that  the  local  maxima  from  Equa¬ 
tions  8  and  9  are  in  fact  the  global  maxima  over  all  feasible 


+  x4(<7  +  1)  -  <7 


x4(<7  T  1)  —  (7 


(9) 


cluster  sizes. 


3  Estimating  parameters 

Given  a  set  of  observed  speedups  s,;  for  cluster  sizes  n,;, 
we  would  like  to  estimate  values  of  the  parameters  x4  and 
cr  that  minimize  the  sum  of  squared  differences  between 
the  observed  values  and  the  fitted  values  (calculated  by 
Equations  2  and  4).  In  other  words,  we  would  like  to 
minimize 


\2(.4,  cr)  =  -  s(n,:;xl,<7)]2  (10) 

i 

where  s(r>i\A,a)  is  the  modeled  speedup  of  a  program 
with  average  parallelism  x4  and  variance  cr,  running  on  n,; 
processors. 

The  Levenberg-Marquardt  method  [11]  performs  this 
minimization  by  a  variation  of  the  conjugate  gradient 
method  that  takes  advantage  of  the  form  of  the  object 
function,  \2. 

Starting  with  an  initial  estimate  for  the  parameters  x4 
and  cr,  we  iteratively  calculate  improved  estimates  based 
on  the  value  of  \2  and  its  derivatives  with  respect  to  x4 
and  cr.  These  derivatives  are 
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where  the  partial  derivatives  jj  and  ^  are  derived  from 
Equations  2  and  4. 

The  Levenberg-Marquardt  method  chooses  adaptively 
between  the  conjugate  gradient  method  (which  converges 
quickly  in  the  vicinity  of  the  minimum)  and  the  method 
of  steepest  descent  (which  is  robust  when  the  object  func¬ 
tion  is  ill-behaved).  In  practice,  we  have  found  that  this 
method  is  robust  and  converges  quickly  for  a  variety  of 
speedup  curves.  We  have  used  this  method  to  estimate 
parameters  for  many  speedup  curves  reported  from  real 
programs  on  a  variety  of  architectures.  The  next  section 
discusses  these  results. 

4  Fitting  observed  speedups 

We  developed  our  speedup  model  with  two  applications 
in  mind: 

Workload  modeling  :  We  would  like  to  know  whether 
our  model  captures  the  behavior  of  parallel  scientific 
applications,  and  if  so,  what  values  of  the  parameters 
A  and  a  are  typical  of  the  workloads  in  supercomput¬ 
ing  environments. 

Summarizing  program  behavior  :  For  purposes  of 
scheduling  parallel  jobs,  we  can  use  a  database  of  past 
execution  times  to  predict  future  execution  times.  If 
our  model  fits  observed  speedup  curves  well,  then  we 
can  greatly  compress  this  database  by  recording,  for 
each  program  and  problem  size,  only  the  two  param¬ 
eters  A  and  a. 

In  order  to  evaluate  the  usefulness  of  our  model,  we  have 
collected  reported  speedup  curves  from  numerous  scien¬ 
tific  applications  running  on  a  variety  of  parallel  architec¬ 
tures.  In  general,  we  have  found  that  our  model  is  capable 
of  summarizing  the  behavior  of  most  real  programs.  In  the 
next  two  sections,  we  present  some  of  this  data,  point  out 
cases  in  which  our  model  fails,  and  suggest  ways  of  dealing 
with  these  failures. 

4.1  NAS  benchmarks 

The  Numerical  Aerospace  Simulation  Facility  (NAS)  at 
NASA  Ames  Research  Center  has  compiled  a  suite  of 
benchmarks  intended  to  be  representative  of  computa¬ 
tional  fluid-dynamic  codes.  The  original  NAS  Parallel 
Benchmarks  (NPB  1)  were  algorithmic  specifications  of 
eight  computations.  NPB  2  consists  of  implementations 
of  four  of  those  computations  in  Fortran  77  with  MPI. 
NAS  has  reported  timings  of  these  codes  on  four  different 
distributed-memory  computers  with  three  different  prob¬ 
lem  sizes.  The  programs  are: 


LU  :  Solves  Navier-Stokes  equations  in  3-D  using  LU  de¬ 
composition  and  successive  over-relaxation  (SSOR). 
Due  to  the  internal  structure  of  the  code,  it  requires 
power-of-two  cluster  sizes. 

SP  and  BT  :  Both  solve  Navier-Stokes  equations  in  3- 
D,  based  on  a  Beam-Warming  factorization.  In  SP, 
the  resulting  system  is  scalar  pentadiagonal;  in  BT, 
it  is  block  triangular.  In  both  cases,  the  system 
is  solved  by  Gaussian  elimination  without  pivoting. 
The  decomposition  used  (3-D  multipartitioning)  re¬ 
quires  cluster  sizes  that  are  perfect  squares. 

MG  :  Solves  Poisson’s  equation  using  a  V-cycle  multigrid 
algorithm.  This  code  requires  power-of-two  cluster 
sizes. 

The  codes  were  run  on  an  IBM  SP2,  an  SGI  Power  Chal¬ 
lenge  Array,  a  Cray  Research  T3D  and  an  Intel  Paragon. 
The  three  problem  sizes  are  named  Class  A  (the  small¬ 
est),  Class  B  and  Class  C  (the  largest).  For  more  details 
about  the  benchmarks  and  test  architectures,  see  [13]  and 
http://www.nas.nasa.gov/NAS/NPB. 

Figure  5  shows  the  speedups  observed  on  the  SP2  and 
our  estimated  parameters  and  speedup  curves.  In  all 
cases,  the  fitted  curve  matches  the  observed  data  well. 
The  only  exception  is  the  surprisingly  bad  performance 
of  the  SP  class  C  benchmark  on  36  processors.  This  da¬ 
tum  may  be  in  error,  or  it  may  be  a  consequence  of  a 
memory-system  phenomenon  like  a  cache  collision. 

Memory  requirements  prevent  some  Class  C  bench¬ 
marks  from  running  on  small  cluster  sizes.  In  these  cases 
we  normalize  the  observed  speedup  values  before  estimat¬ 
ing  parameters  —  in  effect,  we  treat  the  smallest  feasible 
cluster  size  as  a  single  processing  unit.  Thus,  the  effi¬ 
ciency  on  the  smallest  cluster  size  is  defined  to  be  1.  We 
have  observed  that  this  normalization  does  not  degrade 
the  goodness  of  fit  of  the  model. 

For  each  benchmark,  we  expect  the  estimated  average 
parallelism,  A,  to  increase  with  problem  size.  In  fact,  this 
is  true  of  LU,  MG  and  SP,  but  not  true  of  BT.  The  es¬ 
timated  value  of  A  for  class  C  is  smaller  than  that  for 
class  B.  This  example  illustrates  one  failure  mode  of  our 
model:  if  the  observed  timings  exhibit  linear  or  near-linear 
speedup,  then  our  model  has  no  way  of  inferring  the  aver¬ 
age  parallelism  of  the  code.  Any  value  of  A  (greater  than 
the  largest  observed  speedup)  would  yield  the  same  good¬ 
ness  of  fit.  It  happens  that  the  curve-fitting  technique 
we  use  converges  on  A  =  smax,  where  smax  is  the  largest 
observed  speedup. 

Figure  6  shows  the  speedups  observed  on  the  T3D.  Be¬ 
cause  the  amount  of  memory  per  node  is  smaller  on  this 
machine  than  on  the  SP2,  none  of  the  benchmarks  were 
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able  to  run  on  a  single  processor.  Thus  all  of  our  curve 
fits  are  based  on  normalized  data. 

Unlike  the  SP2,  which  can  allocate  arbitrary  cluster 
sizes,  the  T3D  can  only  allocate  power-of-two  cluster  sizes. 
As  a  result,  the  performance  of  BT  and  SP,  which  require 
square  cluster  sizes,  is  erratic.  For  example,  SP  class  B 
runs  significantly  slower  on  196  processors  than  on  144.  In 
both  cases,  the  actual  allocated  cluster  size  is  256  proces¬ 
sors,  so  it  is  not  clear  why  using  fewer  of  the  allocated  pro¬ 
cessors  results  in  better  performance.  Because  our  model 
does  not  capture  the  behavior  of  these  programs,  it  may 
not  be  an  appropriate  choice  for  modeling  a  power-of-two 
machine. 

The  MG  benchmark  exhibits  another  behavior  that  our 
model  does  not  capture:  superlinear  speedup.  Many 
programs  perform  poorly  on  small  cluster  sizes  because 
the  performance  of  the  memory  system  degrades  as  the 
amount  of  data  per  processor  increases.  For  these  pro¬ 
grams,  the  speedup  curve  exceeds  the  theoretical  upper 
bound  of  our  model. 

Fortunately,  we  can  detect  this  behavior  easily:  a  nega¬ 
tive  estimated  a  indicates  superlinear  speedup.  Thus,  our 
model  can  be  used  to  set  a  lower  bound  on  the  cluster  size 
for  a  job:  if  a  is  negative,  we  discard  observations  from 
the  low  end  of  the  range  until  it  is  positive.  It  is  probably 
not  desireable  to  allocate  a  smaller  cluster  size,  since  the 
job  would  run  inefficiently. 

There  is  one  other  behavior  that  our  model  does  not 
capture:  non-monotonic  speedup  curves.  On  large  clus¬ 
ter  sizes,  the  communication  overhead  for  some  programs 
eventually  overwhelms  the  benefits  of  additional  paral¬ 
lelism  and  the  speedup  curve  begins  to  decline.  Of  course, 
there  is  little  value  in  modeling  this  behavior,  since  it  is 
never  desireable  for  an  job  to  allocate  a  large  cluster  if 
it  runs  faster  on  a  smaller  one.  If  we  observe  declining 
speedups,  we  can  discard  observations  from  the  upper  end 
of  the  range  until  the  curve  is  monotonic,  and  impose  an 
upper  bound  on  the  cluster  size  that  many  be  allocated. 

4.2  SPLASH-2  programs 

We  obtained  speedup  curves  for  the  SPLASH-2  programs 
running  on  a  simulated  shared-memory  computer  [19]  [18]. 
The  Stanford  Parallel  Applications  for  Shared  memory 
(SPLASH)  suite  consists  of  8  complete  programs  and  4 
computation  kernels  that  are  intended  to  span  a  wide 
range  of  scientific  applications.  These  include  LU  de¬ 
composition,  a  ray-tracing  program,  an  ocean  model,  an 
n-body  solver,  and  more.  For  details,  see  http://www- 
flash.stanford.edu. 

For  each  program,  we  obtained  the  measured  speedup 
on  6  cluster  sizes  (2,  4,  8,  16,  32  and  64  processors).  The 


speedup  on  one  processor  is  defined  to  be  one.  Figure  7 
shows  these  observed  speedups  and  the  speedup  model  we 
estimated  for  each  program.  In  each  case,  we  observe  that 
the  fitted  model  is  a  good  match  for  the  observed  data. 

Each  graph  is  labeled  with  the  name  of  the  benchmark 
and  the  estimated  parameters  A  and  a.  Two  of  the  pro¬ 
grams  exhibit  nearly  linear  speedup.  Others  have  much 
more  limited  parallelism;  the  lowest  value  is  A  is  20.3.  The 
value  of  a  is  generally  less  than  1,  although  two  programs 
yielded  estimates  of  a  =  1.7  and  a  =  2.7.  The  distribution 
of  a  is  similar  on  the  NAS  benchmarks. 

5  Conclusions 

•  Our  proposed  speedup  model  captures  the  behavior  of 
numerous  scientific  applications  running  on  a  variety 
of  parallel  computers,  both  shared-  and  distributed- 
memory.  Thus,  we  feel  that  this  model  is  a  realistic 
choice  for  modeling  parallel  workloads. 

•  The  parameters  of  our  model  correspond  to  measure- 
able  program  charactericstics.  Thus,  our  observations 
of  these  benchmarks  give  us  some  insight  into  the 
values  and  distibutions  of  these  parameters  in  a  real 
workload. 

5.1  Future  work 

We  have  suggested  that  we  can  infer  the  program  char¬ 
acteristics  A  and  a  of  a  program  by  fitting  our  model  to 
observed  speedups.  We  have  shown  that  our  model  fits  ob¬ 
served  speedups  well;  thus  it  would  be  appropriate  to  use 
these  estimated  parameters  for  allocation  and  scheduling. 
But  we  have  not  demonstrated  that  the  estimated  param¬ 
eters  necessarily  reflect  the  actual  program  characteristics 
as  they  might  be  derived  from  a  known  parallelism  profile. 
We  have  identified  cases  in  which  they  do  not,  and  these 
cases  suggest  restrictions  on  when  and  how  this  approach 
will  be  successful.  In  future  work,  we  plan  to  clarify  these 
restrictions  and  determine  how  meaningful  the  estimated 
parameters  really  are. 
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Figure  5:  Speedup  curves  for  the  NAS  benchmarks  run  on  the  IBM  SP2  at  NAS.  Each  row  reports  the  results  of  one 
benchmark  on  three  difference  problem  sizes.  Class  A  is  the  smallest  problem  size;  Class  C  is  the  largest.  The  gray  lines 
show  the  theoretical  upper  limit  (a  =  0)  and  lower  limit  (a  =  oo)  for  a  program  with  the  given  average  parallelism,  A. 
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Figure  6:  Speedup  curves  for  the  NAS  benchmarks  run  on  the  Cray  T3D  at  NAS.  Each  row  reports  the  results  of  one 
benchmark  on  three  difference  problem  sizes.  Class  A  is  the  smallest  problem  size;  Class  C  is  the  largest. 
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Figure  7:  Speedup  curves  for  the  SPLASH-2 


benchmark  run  on  a  simulated  shared-memory  machine. 
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